The law of leaky abstractions and Reddit’s experience with the cloud

Reddit had a 6 hour downtime that was caused by them running a database backed by Amazon's cloud disk storage product (EBS Elastic Block Store), but EBS is both unreliable and doesn't flush writes to disk when told to. This led to corrupted data and disagreements between the master database and slave databases, causing the slaves to not be usable while the master was down.

Modern SQL databases are not written to work right on hardware that does not have a reliable flush to disk operation. Correct functioning of a modern database absolutely requires that "write back" caching can be shut off, so that if a disk reports a write succeeded that the write actually succeeded. See for example http://www.postgresql.org/docs/current/static/wal-reliability.html ; this is not at all unique to postgres, I believe that every sql database requires committed writes to actually commit to disk.

The law of leaky abstractions says that as we virtualize more we can do things that appear to work but don't really actually truly work exactly how we think they do, and Murphy says it will hit us with public downtime.

Row ID #770 - Bob submits article about puppies. Master says it commits, so data is sent to slaves. Master lied, data was actually in a cache somewhere, write later actually fails in the master - but succeeded in the slaves.

Row ID #770 - John submits article about kittens. Master now has kittens, slaves have puppies, dogs and cats living together mass hysteria reddit is down for 6 hours migrating master data to new hardware and manually hacking up rebuilt slave tables.

I don't know how this works with nosql systems and eventual consistency. Is cassandra ok to run on ebs disks but postgres not at all? Reddit says their solution is they are going to move to using local ec2 disks; is that actually a solution, or does it just make hitting the problem less likely because ec2 local disks are more reliable than ebs? Do they still do write back caching?

Meanwhile, Netflix has pointed out that they have actually moved most of their functionality into the cloud. Specifically, most everything that scales with customers and streaming usage is now served from clouds (although movies come from CDNs, not Amazon's EC2.)

Netflix has posted some really interesting information about the testing they did on EC2: http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html

And their lessons learned is a great place to start when considering working in the cloud at scale: Netflix 5 Lessons We’ve Learned Using AWS

The upshot is, scaling by working in the cloud leads to a whole new set of challenges. You have to invest more in writing your software to handle hardware failure, you have to test failure scenarios more, you may have to go so far as to redesign network protocols to be less chatty because you have unpredictable latency from shared systems, and you have to expect problems from when abstractions leak as layers of complexity are added to what used to be a simple operation like “write this to disk”. If you’re at the point where your hardware costs from scaling exceed your software development costs, or if you truly need to be able to handle rapid customer growth faster than you can expand traditional data center use, it can make a lot of sense to tackle these challenges. But it’s not a no-brainer no-effort proposition – development and testing are going to get harder to handle new scenarios as you switch to using a larger quantity of less reliable resources.

Reddit's explanation:

http://blog.reddit.com/2011/03/why-reddit-was-down-for-6-of-last-24.html

Netflix uses simpledb, hadoop, and cassandra.

http://nosql.mypopescu.com/post/2981945438/why-netflix-picked-amazon-simpledb-hadoop-hbase-and

http://techblog.netflix.com/2011/01/nosql-at-netflix.html

-David

2cee473f-8cab-42c3-a28f-f54d4078d00f|0|.0

Posted by David Eison on Sunday, March 20, 2011 2:16 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

DNN Event Viewer times out

Continuing in our theme of adjusting DNN sprocs…

The DNN Event Viewer first runs a purge sproc, then runs a get log sproc.

If you have a lot of events, the purge sproc is almost certain to time out.

The get log sproc can benefit from a nolock. But the purge sproc is where we saw most of our trouble.

The purge sproc in dnn 551:

    ;WITH logcounts AS

      SELECT

        LogEventID,

        LogConfigID,

        ROW_NUMBER() OVER(PARTITION BY LogConfigID ORDER BY LogCreateDate DESC) AS logEventSequence

      FROM dbo.EventLog with(NOLOCK)

    DELETE dbo.EventLog

    FROM dbo.EventLog el

        JOIN logcounts lc ON el.LogEventID = lc.LogEventID

        INNER JOIN dbo.EventLogConfig elc ON elc.ID = lc.LogConfigID

    WHERE elc.KeepMostRecent <> -1

        AND lc.logEventSequence > elc.KeepMostRecent

This was failing for a few clients of ours. They would end up with say 65k records in the event log table, and this would never complete.

This should lock less rows, delete in 1000 record chunks, and put an upper bound on how many records it will tackle at once. This helped this lookup quit failling for our client:

ALTER PROCEDURE [dbo].[PurgeEventLog]
AS
SET NOCOUNT ON
SET DEADLOCK_PRIORITY LOW

create table #TLog (LogGUID uniqueidentifier not null primary key, LogCreateDate datetime)

;WITH logcounts AS
(  
  SELECT 
    LogEventID, 
    LogConfigID, 
    ROW_NUMBER() OVER(PARTITION BY LogConfigID ORDER BY LogCreateDate DESC) AS logEventSequence
  FROM dbo.EventLog with(NOLOCK)
)
insert into #TLog
 SELECT LogGUID, LogCreateDate
    FROM dbo.EventLog el with(NOLOCK)
        JOIN logcounts lc with(NOLOCK) ON el.LogEventID = lc.LogEventID
        INNER JOIN dbo.EventLogConfig elc with(NOLOCK) ON elc.ID = lc.LogConfigID
    WHERE elc.KeepMostRecent <> -1
        AND lc.logEventSequence > elc.KeepMostRecent 

declare @intRowCount int
declare @intErrNo int
declare @commiteveryn int
declare @maxloops int

set @commiteveryn=1000
set @intErrNo=0
set @intRowCount=1 -- force first loop
set @maxloops=20

WHILE @intRowCount > 0 and @maxloops > 0
    BEGIN
        set @maxloops = @maxloops - 1
        BEGIN TRANSACTION
        BEGIN TRY
        DELETE FROM EventLog WHERE LogGuid IN (select top (@commiteveryn) LogGUID from #TLog order by LogCreateDate DESC)
        SELECT @intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT        
        DELETE FROM #TLog WHERE LogGuid IN (select top (@commiteveryn) LogGUID from #TLog order by LogCreateDate DESC)

        commit
        END TRY
        BEGIN CATCH
         rollback;
         set @maxloops=0
        END CATCH
    END

drop table #TLog

-- used to be
    --;WITH logcounts AS
    --(  
    --  SELECT 
    --    LogEventID, 
    --    LogConfigID, 
    --    ROW_NUMBER() OVER(PARTITION BY LogConfigID ORDER BY LogCreateDate DESC) AS logEventSequence
    --  FROM dbo.EventLog with(NOLOCK)
    --)
    --DELETE dbo.EventLog 
    --FROM dbo.EventLog el 
    --    JOIN logcounts lc ON el.LogEventID = lc.LogEventID
    --    INNER JOIN dbo.EventLogConfig elc ON elc.ID = lc.LogConfigID
    --WHERE elc.KeepMostRecent <> -1
    --    AND lc.logEventSequence > elc.KeepMostRecent 




GO

4bba3198-b4b4-44fe-ac36-9a1ad4464074|1|5.0

Posted by David Eison on Monday, February 28, 2011 2:26 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

More DNN performance

Some DNN sites spend way too much time running the sproc dbo.GetSchedule. This is probably worse if DNN is configured with its Scheduled Jobs in the default ‘Request’ mode (instead of ‘Timer’ mode). Unfortunately that job is both slow and can deadlock on updates.

The original job we had in our DNN 5.6.1 is doing:

SELECT
S.*,
        SH.NextStart
    FROM dbo.Schedule S
        LEFT JOIN dbo.ScheduleHistory SH ON S.ScheduleID = SH.ScheduleID
    WHERE (SH.ScheduleHistoryID = (SELECT TOP 1 S1.ScheduleHistoryID
                                        FROM dbo.ScheduleHistory S1
                                        WHERE S1.ScheduleID = S.ScheduleID
                                        ORDER BY S1.NextStart DESC)
                OR SH.ScheduleHistoryID IS NULL)
            AND (@Server IS NULL OR S.Servers LIKE '%,' + @Server + ',%' OR S.Servers IS NULL)

Here’s almost the same thing, but faster and less likely to deadlock:

    SELECT
        S.*,
        (SELECT TOP 1 NextStart FROM ScheduleHistory S1 with(nolock)
         WHERE S1.ScheduleID = S.ScheduleID
         ORDER BY S1.NextStart DESC) as NextStart
    FROM dbo.Schedule S with(nolock)
    WHERE (@Server IS NULL OR S.Servers LIKE '%,' + @Server + ',%' OR S.Servers IS NULL)

Replacing this one query dropped one problematic DNN site from 100% sql server cpu utilization to more in the 30% range.

75b6e1a8-c9ae-42cc-8452-eeecb1741593|1|5.0

Categories: Web presentation technologies | SQL Server

Posted by David Eison on Thursday, February 17, 2011 1:33 AM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

Digging through Event logs

I haven’t found a tool that I love that parses Event Viewer well.

What I do these days I use psloglist from Windows SysInternals to dump the log to a tab delimited file, then I hack on strings in Excel using IFERROR, SEARCH, RIGHT and LEFT to get decently representative strings and then sort and subtotal.

Also today I needed to read a DNN error table while DNN was not behaving well. DNN writes XML to the database; the easiest thing to do is to cast it to XML with

cast(LogProperties as XML)

Plug it into a temp table, then process the XML using Sql Server’s XML query command:

select cast(LogProperties as XML) as props, LogTypeKey, LogGUID, LogCreateDate into #tmptable from EventLog with(nolock)
WHERE LogTypeKey='GENERAL_EXCEPTION'

select count(*) as cnt,a.msg
from
(select cast(props.query('LogProperties/LogProperty/PropertyName[text()="Message"]/../PropertyValue/text()') as nvarchar(MAX)) as msg
from #tmptable) a
group by a.msg
order by COUNT(*) desc

select cast(props.query('LogProperties/LogProperty/PropertyName[text()="Message"]/../PropertyValue/text()') as nvarchar(MAX)) as msg, *
into #tmp2
from #tmptable

select * from #tmp2 where msg like '%[interesting keyword]%' order by LogCreateDate desc

I test out xpath using Xpath Visualizer.

ae1f382e-1418-48ec-9d79-9d7778be762f|0|.0

Posted by David Eison on Monday, February 7, 2011 7:12 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

HTTP: PUT vs POST

PUT doesn’t come up much with plain html work. Pages and forms pretend that everything is a get or a post. But once one is working with web services, your choices are:

SOAP – send everything as an HTTP POST. HTTP is being used mainly to dodge firewalls and not really providing any benefit. Protocol was intentionally designed to be too complex for humans, leading to dependence on tools.

REST – use verbs applied to nouns. HTTP has other useful verbs besides GET and POST.

Unfortunately, PUT is conceptually similar so easy to get confused. I thought I’d mention here in one spot the two key differences:

a) PUT is indempotent. Repeating the same operation should get the same result. Side effects are allowed, but the side effects should be the same for repeated requests. POST is not indempotent. Repeating the same operation may yield different results.

b) PUT contains the actual item for a specified resource. POST provides an item for the specified resource to work with.

For a practical example, POST would be used to tell an account to add a new note to it, while PUT would be used to edit the contents of a note. PUT can’t do an “append” operation, if you POST your ‘add a note’ twice you have two notes, if you PUT the same note twice you just have the one note in the end.

I think it’s really important to consider both of these things together – the side effects are different, and the entity you specify are different – PUT says “this resource is this data”. POST says “this resource should work with this data”.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html for a primary source

212f26f0-80db-4a7f-bfd5-8d2d00606ee5|6|2.8

Posted by David Eison on Wednesday, December 8, 2010 2:40 AM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

CRM entity browser appears unreliable

CRM has this nice entity browser that is supposed to show you info about your entities, like what fields there are and how long they can be. You go to your server and add /sdk/list.aspx to the url and it shows you what entitites you have and what their attributes are.

I used to rely on it because as a programmer, it’s the quickest/most convenient way to get the length information and the ‘valid for update’ information about fields. Unfortunately, it’s apparently unreliable.

The below screenshot is from a client site. They only have 1 CRM organization configured, yes I’m pointing to the right server on both, and yes their contact is published. But the entity browser does not show that the fields have had their sizes reduced (to better fit on mailing labels), and if I relied on this to write my code for working with addresses, I’d get it all wrong.

fe767fe6-c527-49cf-a41d-ec4eab0f3778|2|1.0

Posted by David Eison on Friday, December 3, 2010 12:16 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

Loading a CRM page from a post-build event

CRM uses NTLM authentication, so you can’t just pull down a page using any simple thing.

Luckily, “curl” is a nice command-line tool for loading webpages that supports NTLM.

The only tricky part is, not all builds support the “--ntlm” and ”-u :” features you need for NTLM to work. The Win32 Generic 7.21.2 binary build at http://curl.haxx.se/download.html is working for me (the MSVC build supports ntlm but “-u :” silently doesn’t work. You can tell it failed because you get back a 401.)

So, I now put a copy of curl in my project’s references folder, and have added to my post-build events that bounce IIS:

SETLOCAL ENABLEDELAYEDEXPANSION
set URL=http://localcrm/orgname/loader.aspx

. . .

set loop=0
:TRYCURL
rem Load front page to get app pool started up again
set /a loop=%loop%+1
echo Loading site to initialize app pool %URL%
rem curl should use --ntlm -u : to pull user from the environment.
"$(SolutionDir)\references\curl\curl" --user-agent "Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)" --location --location-trusted --ntlm -u : --silent --show-error -w"%%{http_code}" "%URL%" > NUL
if !ERRORLEVEL! NEQ 0 GOTO CURLFAIL
goto DONE
:CURLFAIL
if %LOOP% LEQ 4 GOTO TRYCURLSLEEP
goto FAIL
:TRYCURLSLEEP
echo sleeping before retry
sleep 1
goto TRYCURL
:DONE
echo OK at %TIME%

c26b035c-b3b8-44b8-a1f7-2b0353ddde59|0|.0

Categories: CRM | Microsoft Dynamics CRM 4.0

Posted by David Eison on Sunday, November 28, 2010 2:14 AM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

CRM API – Picklist details

Just thought I’d share a bug I ran into with everyone and hope it helps you avoid it.

I see plenty of code that sets picklist values. You might do this in javascript, or in the CRM API. Set a new value, submit the update, picklist value is changed.

But, it’s easy to miss that picklist has two fields: a Name, and a Value. Value is something boring like 2, name is something to show to the user, like “Critical”.

You would think that writing code like this was great:

Picklist prop = source.Properties[attribute] as Picklist;
if (prop == null || prop.IsNull)
{
    return defaultvalue;
}
return prop.name;

However, a problem crops up when dealing with client code written by people who didn’t understand that the name field is important. If you put that code into a plugin, you get passed straight the name + value that the API client specified – so you could run into javascript code that only changed the value, or into other CRM API code that only changed the value, and name can either be completely not set, or worse, set to an old previous value.

So, when writing server side code that handles data submitted by the client, it looks like you’ll need to only trust the value and ignore the name. It’s possible you could audit your client code and make sure that everywhere a new value is set a new name is set too.. but one day somebody will find some code from another project, add it to yours, and your picklist handling will be wrong.

Happy Thanksgiving!

1e3881d1-82f1-48cb-9786-b5b3f15de4c5|1|1.0

Categories: CRM | Microsoft Dynamics CRM 4.0

Posted by David Eison on Saturday, November 27, 2010 11:57 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

CRM and ViewState

Microsoft Dynamics CRM 4.0 doesn’t use ViewState, or Sessions. Indeed, they are disabled in the web.config file. This can be a bit of a surprise to an ASP.NET developer working on a custom page in the /ISV directory.

Three possible solutions on ViewState:

1) Don’t use ViewState.

This requires re-initializing any data with every postback. One thing you can do is minimize postbacks. You’ll see CRM does this in many places by encouraging the use of Javascript and popping extra windows to handle immediate-response things like filling in or validating a lookup field. If you are displaying a grid, it will be empty after postback because it wasn’t re-populated from viewstate, so you need to repopulate it on every request – which means you also need to deal with the possibility that values may have changed due to another user editing records in the meantime. Some simple strategies to start with are to refer to records by guids instead of by row numbers and to minimize updates to only the fields your user actually changed, and to code defensively.

2) You can enable viewstate for a particular page by adding it to the page directive at the top of the page:

<%@ Page . . . EnableViewState="true" . . .

Note that if you have a server cluster and use viewstate, you’ll either need to set a machine key in web.config or else disable viewstate validation with another Page directive:

<%@ Page . . . EnableViewState="true" EnableViewStateMac="false" %>

Disabling ViewStateMac means your users will be able to tamper with the viewstate, so just keep that concern in mind if you have custom permissioning rules in your app beyond CRM’s built in permissioning.

3) You can enable viewstate for all of your ISV pages by setting it up as its own app.

Create a virtual directory under /ISV, point it to your pages, and give yourself a web.config that sets enableviewstate for your pages (and a machine key). See, for example, this guide at xrmlinq.

Remember that if you’re going to use viewstate, you should keep an eye on viewstate size ; perhaps you don’t want to transfer a megabyte of serialized grid data with every page load. You can see how big viewstate is by viewing source on your page, or by adding some code to your pages (If Request.IsLocal is true and DEBUG is defined, I tack on a label with the size from LosFormatter; see example code at scottonwriting ); but in general, if you use viewstate on a grid or listbox, you’re going to be storing a lot of data. If you’d rather repopulate your grid or listbox with every request instead of serializing their data, simply databind them before viewstate begins being tracked - during init instead of during load. See ASP.NET Page Lifecycle.

As for Sessions, I would recommend avoiding them. You’ll probably have some real headaches if you need to support the offline client and rely on sessions, and it can be hard to anticipate (and test for) how sessions will be affected by one user popping several windows open. In general, if data is ephemeral it can be handled well by viewstate. If data is not ephemeral, you probably want to be storing it in a database. So pass on session.

50713772-be1f-41bb-8ca7-2b30e6c1419f|2|4.0

Categories: Microsoft Dynamics CRM 4.0 | ASP.NET

Posted by David Eison on Tuesday, October 19, 2010 3:04 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

Silverlight Issues

Silverlight often feels a bit much like a work in progress, other times just like parts weren’t thought through all the way. I thought I’d post a few examples of the sorts of things we’ve run into while developing with it. Main lesson is, somehow you will need to plan for the unplanned.

For example, in the global application error handler:

Exception tmp = e.ExceptionObject;
if (tmp != null)
{
    while (tmp.InnerException != null)
    {
        tmp = tmp.InnerException;
    }

    // Silverlight defect where File stream can error out if file open failed due 
    // to an open file error, then later the garbage collector tries to clean up
    // the stream.  Can't be reliably handled at regular code level because we don't
    // get back a stream to operate on due to the file method erroring out.
    if (tmp is InvalidOperationException)
    {
        // string parse is bad and fragile, but I can't find any way to get the error code
        // this should be error code -2146233079 per Reflector, but, well.
        if (tmp.Message.Contains("UI Thread") && tmp.Message.Contains("System.Windows.SaveFileStream.Dispose"))
        {
            // ignore
            return;
        }
    }
. . .

Forums suggest this one may have been fixed in 4.

My favorite issue is a layout issue: If you want a label next to a textbox, you put it in a horizontal stackpanel. But horizontal stackpanels don’t constrain themselves horizontally. Bottom line? Your labels won’t wrap and will clip instead. Solution? Specify fixed widths on your labels. I wonder if nobody did a form with labels next to textboxes and a dynamic layout all during the testing and design phase.

We also have these gems:

                // NOTE: A "Dialogs must be user initiated" error can crop up if running under the debugger.
                // It is a spurious error and will not happen when not using the debugger.
                // See http://forums.silverlight.net/forums/t/82454.aspx

// firefox doesn't render our silverlight right in an iframe.
function isfirefox() {
    if (/Firefox[\/\s](\d+\.\d+)/.test(navigator.userAgent)) {
        return true;
    } else {
        return false;
    }
}

/*
    This service has to be activated using basicHttp because Silverlight cannot 
    function against wsHttp.  
*/

        //sizechanged doesn't fire properly when column widths are changed, tabs need to
        // be hidden and refreshed for it to update... also, this can be fired
        // too early, before the column width has really changed. Layoutupdated seems to mostly work, though.

// assigning width directly is sub-optimal because
// it won't handle a situation where this column should be bigger
// than the main column... but the grid doesn't redraw properly
// if we don't.  Tried forcing just minwidth to work 
// with invalidatemeasure, invalidatearrange,
// measure, arrange, and updatelayout, all of which redraw the column
// header right but not the column data.

e4259f8b-90e2-4333-b18a-9ec26a4aacc1|0|.0

Categories: ASP.NET | Silverlight

Posted by David Eison on Sunday, September 26, 2010 5:59 PM

E-mail | Kick it! | DZone it! | del.icio.us

Permalink | Comments (0) | Post RSS RSS comment feed

Arke Systems Blog

Useful technical and business information straight from Arke.

About the author

Recent posts

Recent comments

Archive

Authors

Tags

Categories

Blogroll

Disclaimer

The law of leaky abstractions and Reddit’s experience with the cloud

DNN Event Viewer times out

More DNN performance

Digging through Event logs

HTTP: PUT vs POST

CRM entity browser appears unreliable

Loading a CRM page from a post-build event

CRM API – Picklist details

CRM and ViewState

Silverlight Issues